Content¶
- Introduction
1.1 Author's Details
1.2 Importing the Libraries
1.3 Importing the Dataset - Data Preprocessing
2.1 Lemmatization and Tokenization - Data Cleaning
3.1 Remove stopwords, Remove symbols, Remove URLs - Converting text to vectors
4.1 Using ConuntVectorizer
4.2 Using TF-IDF
4.3 Using Word2Vec
4.4 Using GoogleNews Word2Vec - Applying Machine Learning Models
5.1 Logistics Regression
5.1.1 Logistics Regression with CountVectorizer
5.1.2 Logistics Regression with TFIDFVectorizer
5.1.3 Logistics Regression with Word2Vec
5.1.4 Logistics Regression with GoogleNews Word2Vec
5.2 SVM
5.2.1 SGD Classifier & GridSearchCV with CountVectorizer
5.2.2 SVM Classifier with TFIDFVectorizer
5.2.3 SVM Classifier & GridSearchCV with Word2Vec
5.2.4 SGD Classifier with GoogleNews Word2Vec
5.3 Random Forest
5.3.1 Random Forest with CountVectorizer
5.3.2 Random Forest with TFIDFVectorizer
5.3.3 Random Forest with Word2Vec
5.3.4 Random Forest with GoogleNews Word2Vec - Visualizing the Results
6.1 Visualizing the results using bar plots
6.2 Visualizing the results using sunburst plots - Conclusion
1. Introduction¶
1.1. Author's Details¶
1.2. Importing the libraries¶
In [1]:
# Importing the libraries
import pandas as pd
1.3. Importing the dataset¶
In [2]:
# Importing the dataset
# Define the column names for the data
columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
dataset = pd.read_csv('training.1600000.processed.noemoticon.csv',encoding='latin-1', names=columns)
df = pd.DataFrame(dataset)
df.head()
Out[2]:
| target | ids | date | flag | user | text | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | _TheSpecialOne_ | @switchfoot http://twitpic.com/2y1zl - Awww, t... |
| 1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can't update his Facebook by ... |
| 2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man... |
| 3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
| 4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it's not behaving at all.... |
2. Data Preprocessing¶
In [3]:
# Preprocessing the data using NLTK
# Importing the libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
2.1. Lemmatization and Tokenization¶
In [4]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Defining a function to tokenize and lemmetize the text
def tokenize_and_lemmatize(text):
tokens = word_tokenize(text)
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
return " ".join(lemmatized_tokens)
In [5]:
# Applying the function to the text column
df['text'] = df['text'].apply(tokenize_and_lemmatize)
3. Data Cleaning¶
3.1 Remove stopwords, Remove symbols, Remove URLs¶
In [6]:
# Data Cleansing: Remove stopwords, remove symbols, remove URLs
# Importing the libraries
import re
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
In [7]:
# Defining a function to clean the text
def clean_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove symbols and numbers
text = re.sub(r'[^\w\s]', '', text)
# Remove stopwords
text = " ".join([word for word in text.split() if word.lower() not in stop_words])
return text
In [8]:
# Applying the clean text function to the text column
df['text'] = df['text'].apply(clean_text)
# Displaying the first 5 rows of the dataset
df.head()
Out[8]:
| target | ids | date | flag | user | text | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | _TheSpecialOne_ | switchfoot http twitpiccom2y1zl Awww bummer sh... |
| 1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | upset ca nt update Facebook texting might cry ... |
| 2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | Kenichan dived many time ball Managed save 50 ... |
| 3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | whole body feel itchy like fire |
| 4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | nationwideclass behaving mad ca nt see |
4 Converting text to vectors¶
4.1 Using ConuntVectorizer¶
In [9]:
# Using countvectorizer to convert text to vectors
from sklearn.feature_extraction.text import CountVectorizer
# Initialize CountVectorizer
count_vectorizer = CountVectorizer()
# Fit and transform the text column
count_vectors = count_vectorizer.fit_transform(df['text'])
# Print the shape and a small sample of the output
print("Shape of CountVectorizer output:", count_vectors.shape)
print("Sample of CountVectorizer output:\n", count_vectors[0:5])
Shape of CountVectorizer output: (1600000, 727739) Sample of CountVectorizer output: (0, 595810) 1 (0, 276316) 1 (0, 644810) 1 (0, 59374) 1 (0, 111749) 1 (0, 557559) 1 (0, 248977) 1 (0, 165439) 1 (0, 120949) 1 (0, 615155) 1 (0, 166506) 1 (1, 671025) 1 (1, 114698) 1 (1, 452382) 1 (1, 670642) 1 (1, 213625) 1 (1, 606690) 1 (1, 405797) 1 (1, 154747) 1 (1, 518928) 1 (1, 542640) 1 (1, 628477) 1 (1, 35361) 1 (1, 92175) 1 (2, 336503) 1 (2, 179863) 1 (2, 386879) 1 (2, 618841) 1 (2, 63727) 1 (2, 385428) 1 (2, 540624) 1 (2, 10835) 1 (2, 518748) 1 (2, 246072) 1 (2, 102239) 1 (3, 691807) 1 (3, 99036) 1 (3, 218206) 1 (3, 295456) 1 (3, 363591) 1 (3, 222204) 1 (4, 114698) 1 (4, 452382) 1 (4, 437496) 1 (4, 70910) 1 (4, 381266) 1 (4, 546259) 1
4.2 Using TF-IDF¶
In [10]:
# Using TF-IDF to convert text to vectors
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the text column
tfidf_vectors = tfidf_vectorizer.fit_transform(df['text'])
# Print the shape and a small sample of the output
print("Shape of TFIDFVectorizer output:", tfidf_vectors.shape)
print("Sample of TFIDFVectorizer output:\n", tfidf_vectors[0:5])
Shape of TFIDFVectorizer output: (1600000, 727739) Sample of TFIDFVectorizer output: (0, 166506) 0.13054811111622572 (0, 615155) 0.29393741545363766 (0, 120949) 0.37401407169858575 (0, 165439) 0.2579840561994554 (0, 248977) 0.14270926436560555 (0, 557559) 0.322979922950666 (0, 111749) 0.2728088745783667 (0, 59374) 0.2286521087611859 (0, 644810) 0.5011218374079742 (0, 276316) 0.14173057681969048 (0, 595810) 0.4092879199430562 (1, 92175) 0.34750907847611257 (1, 35361) 0.2597754355522623 (1, 628477) 0.17943211773037318 (1, 542640) 0.23143307534749005 (1, 518928) 0.3410264450244862 (1, 154747) 0.2778733765796918 (1, 405797) 0.26268048486060175 (1, 606690) 0.37293791075842925 (1, 213625) 0.2976181673232232 (1, 670642) 0.2784487199131195 (1, 452382) 0.14080994159310378 (1, 114698) 0.19590489680954815 (1, 671025) 0.3163608806561093 (2, 102239) 0.32828438058252274 (2, 246072) 0.14314841517222332 (2, 518748) 0.23655338834935397 (2, 10835) 0.2849102427517623 (2, 540624) 0.26592489411878467 (2, 385428) 0.29583638174153754 (2, 63727) 0.2743485148007958 (2, 618841) 0.14961335533441 (2, 386879) 0.21934880932101802 (2, 179863) 0.4897489937098964 (2, 336503) 0.436751748711511 (3, 222204) 0.46811392524369927 (3, 363591) 0.2393505949274206 (3, 295456) 0.5443056908529693 (3, 218206) 0.28744898096791516 (3, 99036) 0.43825031772288103 (3, 691807) 0.39066827739426035 (4, 546259) 0.23273403262603484 (4, 381266) 0.35899036890693986 (4, 70910) 0.5606390254491014 (4, 437496) 0.6486229654178736 (4, 452382) 0.16706213283219407 (4, 114698) 0.23242882940643794
4.3 Using Word2Vec¶
In [11]:
# Using word2vec to convert text to vectors
from gensim.models import Word2Vec
# Tokenize the text column
tokenized_text = df['text'].apply(lambda x: x.split())
# Train a Word2Vec model
word2vec_model = Word2Vec(tokenized_text, window=5, min_count=1, workers=4)
word2vec_model.train(tokenized_text, total_examples=len(tokenized_text), epochs=10)
# Create Word2Vec vectors for the text column
word2vec_vectors = tokenized_text.apply(lambda x: [word2vec_model.wv[word] for word in x])
# Print the shape and a small sample of the output
print("Sample of Word2Vec output:", word2vec_vectors.head())
Sample of Word2Vec output: 0 [[-0.3599776, 0.0058297394, -0.17818972, -0.03... 1 [[0.09955056, 3.993704, 2.0638819, -0.03611924... 2 [[0.079917975, -0.19466908, -0.09980759, 0.298... 3 [[1.3553879, 0.45565313, 0.022046551, -2.99064... 4 [[-0.013708479, 0.06021129, -0.03755882, 0.151... Name: text, dtype: object
4.4 Using GoogleNews Word2Vec¶
In [12]:
# Using GoogleNews word2vec to convert text to vectors
from gensim.models import KeyedVectors
# Load the GoogleNews Word2Vec model
google_w2v_path = '/Users/nisargp/PycharmProjects/NLTK/GoogleNews-vectors-negative300.bin'
google_w2v_model = KeyedVectors.load_word2vec_format(google_w2v_path, binary=True)
# Create GoogleNews Word2Vec vectors for the text column
google_w2v_vectors = tokenized_text.apply(lambda x: [google_w2v_model[word] for word in x if word in google_w2v_model])
# Print the shape and a small sample of the output
print("Sample of GoogleNews Word2Vec output:", google_w2v_vectors.head())
Sample of GoogleNews Word2Vec output: 0 [[-0.27929688, 0.034179688, -0.051513672, 0.27... 1 [[0.14160156, 0.15039062, 0.28125, -0.18847656... 2 [[0.12402344, -0.03515625, -0.02722168, -0.124... 3 [[0.07519531, -0.018920898, -0.0053710938, 0.2... 4 [[0.01977539, 0.18359375, 0.00592041, 0.002380... Name: text, dtype: object
Since the vectorization techniques produce different formats of data, we'll need to handle them separately. We'll start with Logistic Regression using CountVectorizer and TFIDFVectorizer.¶
5. Applying Machine Learning Models¶
5.1 Logistics Regression¶
5.1.1 Logistics Regression with CountVectorizer¶
In [13]:
# Logistic Regression with CountVectorizer
# Importing the libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(count_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize and train the Logistic Regression model
lr_model_count = LogisticRegression(max_iter=1000)
lr_model_count.fit(X_train, y_train)
# Make predictions and print the classification report
y_pred_count = lr_model_count.predict(X_test)
print("Classification Report for Logistic Regression with CountVectorizer:\n", classification_report(y_test, y_pred_count))
Classification Report for Logistic Regression with CountVectorizer:
precision recall f1-score support
0 0.79 0.77 0.78 159494
4 0.78 0.80 0.79 160506
accuracy 0.79 320000
macro avg 0.79 0.79 0.79 320000
weighted avg 0.79 0.79 0.79 320000
5.1.2 Logistics Regression with TFIDFVectorizer¶
In [14]:
# Logistic Regression with TFIDFVectorizer
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize and train the Logistic Regression model
lr_model_tfidf = LogisticRegression(max_iter=1000)
lr_model_tfidf.fit(X_train, y_train)
# Make predictions and print the classification report
y_pred_tfidf = lr_model_tfidf.predict(X_test)
print("Classification Report for Logistic Regression with TFIDFVectorizer:\n", classification_report(y_test, y_pred_tfidf))
Classification Report for Logistic Regression with TFIDFVectorizer:
precision recall f1-score support
0 0.80 0.76 0.78 159494
4 0.78 0.81 0.79 160506
accuracy 0.79 320000
macro avg 0.79 0.79 0.79 320000
weighted avg 0.79 0.79 0.79 320000
For using word2vec and Google news word2vec we need to convert the vectors into different format since these vectors are a list of vectors for each token in a document.¶
we need to aggregate them into a single vector for each document before feeding them into the models. A common approach is to take the mean of all vectors for each document.¶
5.1.3 Logistics Regression with Word2Vec¶
In [15]:
# Logistic Regression with Word2Vec
# Importing the libraries
import numpy as np
# Function to calculate the mean vector for each document
def mean_vector(words):
# Filter out words that are not in the Word2Vec model's vocabulary
valid_words = [word for word in words if word in word2vec_model.wv]
if valid_words:
# Calculate the mean vector for valid words
vectors = [word2vec_model.wv[word] for word in valid_words]
return np.mean(vectors, axis=0)
else:
# Return a zero vector if no valid words are found
return np.zeros(word2vec_model.vector_size)
In [16]:
# Apply the function to the tokenized text
word2vec_mean_vectors = np.array(tokenized_text.apply(mean_vector).tolist())
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(word2vec_mean_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize and train the Logistic Regression model
lr_model_word2vec = LogisticRegression(max_iter=1000)
lr_model_word2vec.fit(X_train, y_train)
# Make predictions and print the classification report
y_pred_word2vec = lr_model_word2vec.predict(X_test)
print("Classification Report for Logistic Regression with Word2Vec:\n", classification_report(y_test, y_pred_word2vec))
Classification Report for Logistic Regression with Word2Vec:
precision recall f1-score support
0 0.73 0.72 0.73 159494
4 0.73 0.74 0.73 160506
accuracy 0.73 320000
macro avg 0.73 0.73 0.73 320000
weighted avg 0.73 0.73 0.73 320000
5.1.4 Logistics Regression with GoogleNews Word2Vec¶
In [17]:
# Function to calculate the mean vector for each document
def mean_vector_google_w2v(words):
# Filter out words that are not in the GoogleNews Word2Vec model's vocabulary
valid_words = [word for word in words if word in google_w2v_model]
if valid_words:
# Calculate the mean vector for valid words
vectors = [google_w2v_model[word] for word in valid_words]
return np.mean(vectors, axis=0)
else:
# Return a zero vector if no valid words are found
return np.zeros(google_w2v_model.vector_size)
In [18]:
# Apply the function to the tokenized text
google_w2v_mean_vectors = np.array(tokenized_text.apply(mean_vector_google_w2v).tolist())
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(google_w2v_mean_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize and train the Logistic Regression model
lr_model_google_w2v = LogisticRegression(max_iter=1000)
lr_model_google_w2v.fit(X_train, y_train)
# Make predictions and print the classification report
y_pred_google_w2v = lr_model_google_w2v.predict(X_test)
print("Classification Report for Logistic Regression with GoogleNews Word2Vec:\n", classification_report(y_test, y_pred_google_w2v))
Classification Report for Logistic Regression with GoogleNews Word2Vec:
precision recall f1-score support
0 0.74 0.73 0.73 159494
4 0.73 0.74 0.74 160506
accuracy 0.73 320000
macro avg 0.73 0.73 0.73 320000
weighted avg 0.73 0.73 0.73 320000
5.2 SVM¶
Using SGV on CountVectorizer as count vectorizer makes such a large dataset which is difficult to process by local host or even google collab so Going with SGD classifier¶
5.2.1 SGD Classifier & GridSearchCV with CountVectorizer¶
In [23]:
# Importing the SGD Classifier
from sklearn.linear_model import SGDClassifier
# Split the data into training and testing sets and training the model
X_train, X_test, y_train, y_test = train_test_split(count_vectors, df['target'], test_size=0.2, random_state=42)
sgd_model_count = SGDClassifier(loss='hinge', max_iter=10000)
sgd_model_count.fit(X_train, y_train)
# Make predictions and printing the classification report
y_pred_count_sgd = sgd_model_count.predict(X_test)
print("Classification Report for SGD Classifier with CountVectorizer:\n", classification_report(y_test, y_pred_count_sgd))
Classification Report for SGD Classifier with CountVectorizer:
precision recall f1-score support
0 0.80 0.72 0.76 159494
4 0.75 0.82 0.78 160506
accuracy 0.77 320000
macro avg 0.78 0.77 0.77 320000
weighted avg 0.78 0.77 0.77 320000
In [24]:
# Check the distribution of the target values
target_distribution = df['target'].value_counts()
print(target_distribution)
target 0 800000 4 800000 Name: count, dtype: int64
5.2.2 SVM Classifier with TFIDFVectorizer¶
In [25]:
# Importing the libraries
from sklearn.svm import LinearSVC
# Split the TFIDFVectorizer data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize LinearSVC model
linear_svc_model_tfidf = LinearSVC(max_iter=10000)
# Train the model
linear_svc_model_tfidf.fit(X_train, y_train)
# Make predictions on the test set
y_pred_tfidf_svc = linear_svc_model_tfidf.predict(X_test)
# Print the classification report
print("Classification Report for LinearSVC with TFIDFVectorizer:\n", classification_report(y_test, y_pred_tfidf_svc))
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/svm/_classes.py:32: FutureWarning: The default value of `dual` will change from `True` to `'auto'` in 1.5. Set the value of `dual` explicitly to suppress the warning. warnings.warn(
Classification Report for LinearSVC with TFIDFVectorizer:
precision recall f1-score support
0 0.79 0.77 0.78 159494
4 0.77 0.80 0.78 160506
accuracy 0.78 320000
macro avg 0.78 0.78 0.78 320000
weighted avg 0.78 0.78 0.78 320000
5.2.3 SVM Classifier & GridSearchCV with Word2Vec¶
In [22]:
# Trying SGD Model for word2vec
# Importing the SGD Classifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV # Import RandomizedSearchCV
# Split the Word2Vec mean vectors into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(word2vec_mean_vectors, df['target'], test_size=0.2, random_state=42)
from sklearn.metrics import classification_report
# Hyperparameters to search (reduced grid)
param_grid = {
'alpha': [1e-3, 1e-2], # Reduced number of options
'max_iter': [1000, 5000], # Reduced number of options
'penalty': ['l2'], # Keeping only 'l2'
'loss': ['hinge'] # Keeping only 'hinge'
}
# Create a RandomizedSearchCV object with the SGD Classifier (you can switch to GridSearchCV if desired)
# Here, n_iter is set to 4, so only 4 random combinations will be tried
random_search_word2vec_sgd = RandomizedSearchCV(SGDClassifier(), param_grid, n_iter=4, cv=5, verbose=1, n_jobs=-1)
# Fit the model to the training data
random_search_word2vec_sgd.fit(X_train, y_train)
# Print the best parameters and score
print("Best parameters found:", random_search_word2vec_sgd.best_params_)
# Make predictions on the test set using the best model
y_pred_word2vec_sgd_random = random_search_word2vec_sgd.predict(X_test)
# Print the classification report for the model with the best hyperparameters
print("Classification Report for SGD Classifier with Word2Vec (RandomizedSearchCV):\n", classification_report(y_test, y_pred_word2vec_sgd_random))
Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters found: {'penalty': 'l2', 'max_iter': 5000, 'loss': 'hinge', 'alpha': 0.001}
Classification Report for SGD Classifier with Word2Vec (RandomizedSearchCV):
precision recall f1-score support
0 0.73 0.73 0.73 159494
4 0.73 0.73 0.73 160506
accuracy 0.73 320000
macro avg 0.73 0.73 0.73 320000
weighted avg 0.73 0.73 0.73 320000
5.2.4 SGD Classifier with GoogleNews Word2Vec¶
In [26]:
# Split the GoogleNews Word2Vec mean vectors into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(google_w2v_mean_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize SGD Classifier
sgd_model_google_w2v = SGDClassifier(loss='hinge', max_iter=10000)
# Train the model
sgd_model_google_w2v.fit(X_train, y_train)
# Make predictions on the test set
y_pred_google_w2v_sgd = sgd_model_google_w2v.predict(X_test)
# Print the classification report
print("Classification Report for SGD Classifier with GoogleNews Word2Vec:\n", classification_report(y_test, y_pred_google_w2v_sgd))
Classification Report for SGD Classifier with GoogleNews Word2Vec:
precision recall f1-score support
0 0.74 0.71 0.73 159494
4 0.73 0.76 0.74 160506
accuracy 0.73 320000
macro avg 0.73 0.73 0.73 320000
weighted avg 0.73 0.73 0.73 320000
5.3 Random Forest¶
5.3.1 Random Forest with CountVectorizer¶
In [27]:
from sklearn.ensemble import RandomForestClassifier
# Split the CountVectorizer data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(count_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize Random Forest Classifier with modified hyperparameters
rf_model_count = RandomForestClassifier(n_estimators=50, max_depth=10, max_features='sqrt', random_state=42)
# Train the model
rf_model_count.fit(X_train, y_train)
# Make predictions on the test set
y_pred_count_rf = rf_model_count.predict(X_test)
# Print the classification report
print("Classification Report for Random Forest with CountVectorizer:\n", classification_report(y_test, y_pred_count_rf))
Classification Report for Random Forest with CountVectorizer:
precision recall f1-score support
0 0.71 0.57 0.63 159494
4 0.64 0.76 0.70 160506
accuracy 0.67 320000
macro avg 0.67 0.67 0.66 320000
weighted avg 0.67 0.67 0.66 320000
5.3.2 Random Forest with TFIDFVectorizer¶
In [28]:
# Applying the Random Forest on Tf-IDF Vectorizer
# Split the TFIDFVectorizer data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize Random Forest Classifier with reduced number of trees and limited tree depth
rf_model_tfidf = RandomForestClassifier(n_estimators=50, max_depth=10, max_features='sqrt', random_state=42)
# Train the model
rf_model_tfidf.fit(X_train, y_train)
# Make predictions on the test set
y_pred_tfidf_rf = rf_model_tfidf.predict(X_test)
# Print the classification report
print("Classification Report for Random Forest with TFIDFVectorizer:\n", classification_report(y_test, y_pred_tfidf_rf))
Classification Report for Random Forest with TFIDFVectorizer:
precision recall f1-score support
0 0.71 0.56 0.63 159494
4 0.64 0.77 0.70 160506
accuracy 0.67 320000
macro avg 0.67 0.67 0.66 320000
weighted avg 0.67 0.67 0.66 320000
5.3.3 Random Forest with Word2Vec¶
In [29]:
# Applying the Random Forest on Word2Vec
# Split the Word2Vec mean vectors into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(word2vec_mean_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize Random Forest Classifier with reduced number of trees and limited tree depth
rf_model_word2vec = RandomForestClassifier(n_estimators=50, max_depth=10, max_features='sqrt', random_state=42)
# Train the model
rf_model_word2vec.fit(X_train, y_train)
# Make predictions on the test set
y_pred_word2vec_rf = rf_model_word2vec.predict(X_test)
# Print the classification report
print("Classification Report for Random Forest with Word2Vec:\n", classification_report(y_test, y_pred_word2vec_rf))
Classification Report for Random Forest with Word2Vec:
precision recall f1-score support
0 0.70 0.73 0.71 159494
4 0.72 0.69 0.70 160506
accuracy 0.71 320000
macro avg 0.71 0.71 0.71 320000
weighted avg 0.71 0.71 0.71 320000
5.3.4 Random Forest with GoogleNews Word2Vec¶
In [30]:
# Applying the Random Forest on GoogleNews Word2Vec
# Split the GoogleNews Word2Vec mean vectors into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(google_w2v_mean_vectors, df['target'], test_size=0.2, random_state=42)
# Initialize Random Forest Classifier with reduced number of trees and limited tree depth
rf_model_google_w2v = RandomForestClassifier(n_estimators=50, max_depth=10, max_features='sqrt', random_state=42)
# Train the model
rf_model_google_w2v.fit(X_train, y_train)
# Make predictions on the test set
y_pred_google_w2v_rf = rf_model_google_w2v.predict(X_test)
# Print the classification report
print("Classification Report for Random Forest with GoogleNews Word2Vec:\n", classification_report(y_test, y_pred_google_w2v_rf))
Classification Report for Random Forest with GoogleNews Word2Vec:
precision recall f1-score support
0 0.69 0.73 0.71 159494
4 0.72 0.68 0.70 160506
accuracy 0.70 320000
macro avg 0.71 0.70 0.70 320000
weighted avg 0.71 0.70 0.70 320000
6. Visualizing the Results¶
6.1 Visualizing the results using bar plots¶
In [31]:
from sklearn.metrics import accuracy_score
# Example accuracy calculations (replace the prediction variables with the correct ones from your models)
accuracy_rf_count = accuracy_score(y_test, y_pred_count_rf)
accuracy_rf_tfidf = accuracy_score(y_test, y_pred_tfidf_rf)
accuracy_rf_word2vec = accuracy_score(y_test, y_pred_word2vec_rf)
accuracy_rf_google_w2v = accuracy_score(y_test, y_pred_google_w2v_rf)
accuracy_svc_count = accuracy_score(y_test,y_pred_count_sgd)
accuracy_svc_tfidf = accuracy_score(y_test, y_pred_tfidf_svc)
accuracy_svc_word2vec = accuracy_score(y_test, y_pred_word2vec_sgd_random)
accuracy_svc_google_w2v = accuracy_score(y_test, y_pred_google_w2v_sgd)
accuracy_lr_count = accuracy_score(y_test, y_pred_count)
accuracy_lr_tfidf = accuracy_score(y_test, y_pred_tfidf)
accuracy_lr_word2vec = accuracy_score(y_test, y_pred_word2vec)
accuracy_lr_google_w2v = accuracy_score(y_test, y_pred_google_w2v)
In [33]:
import plotly.express as px
import pandas as pd
import plotly.io as pio
pio.renderers.default='notebook'
# Prepare data as a list of dictionaries
data = [
{'Model': 'Random Forest', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_rf_count},
{'Model': 'Random Forest', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_rf_tfidf},
{'Model': 'Random Forest', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_rf_word2vec},
{'Model': 'Random Forest', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_rf_google_w2v},
{'Model': 'SVM', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_svc_count},
{'Model': 'SVM', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_svc_tfidf},
{'Model': 'SVM', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_svc_word2vec},
{'Model': 'SVM', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_svc_google_w2v},
{'Model': 'Logistic Regression', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_lr_count},
{'Model': 'Logistic Regression', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_lr_tfidf},
{'Model': 'Logistic Regression', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_lr_word2vec},
{'Model': 'Logistic Regression', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_lr_google_w2v},
]
# Convert to DataFrame
df_plot = pd.DataFrame(data)
# Create a bar plot
fig = px.bar(df_plot, x='Model', y='Accuracy', color='Vectorizer', barmode='group',
title='Comparison of Accuracy for Different Models and Vectorizers')
# Show the plot
fig.show()
6.2 Visualizing the results using sunburst plots¶
In [34]:
import plotly.express as px
import pandas as pd
import plotly.io as pio
pio.renderers.default='notebook'
# Prepare data as a list of dictionaries
data = [
# Random Forest
{'Model': 'Random Forest', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_rf_count},
{'Model': 'Random Forest', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_rf_tfidf},
{'Model': 'Random Forest', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_rf_word2vec},
{'Model': 'Random Forest', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_rf_google_w2v},
# SVM
{'Model': 'SVM', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_svc_count},
{'Model': 'SVM', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_svc_tfidf},
{'Model': 'SVM', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_svc_word2vec},
{'Model': 'SVM', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_svc_google_w2v},
# Logistic Regression
{'Model': 'Logistic Regression', 'Vectorizer': 'CountVectorizer', 'Accuracy': accuracy_lr_count},
{'Model': 'Logistic Regression', 'Vectorizer': 'TFIDFVectorizer', 'Accuracy': accuracy_lr_tfidf},
{'Model': 'Logistic Regression', 'Vectorizer': 'Word2Vec', 'Accuracy': accuracy_lr_word2vec},
{'Model': 'Logistic Regression', 'Vectorizer': 'GoogleNews Word2Vec', 'Accuracy': accuracy_lr_google_w2v},
]
# Convert to DataFrame
df_plot = pd.DataFrame(data)
# Create a sunburst plot
fig = px.sunburst(df_plot, path=['Model', 'Vectorizer'], values='Accuracy',
title='Comparison of Accuracy for Different Models and Vectorizers',
color='Accuracy', # Color by accuracy
color_continuous_scale='viridis',template='simple_white')
# Show the plot
fig.show()
7. Conclusion¶
1. Overall Best Performing Model and Vectorizer:¶
Logistic Regression with either CountVectorizer or TFIDFVectorizer has shown the highest accuracy of 0.79. Both combinations have almost identical performance in terms of precision, recall, and F1-score for both classes (0 and 4).¶
2. Comparison Between Models:¶
Logistic Regression: Consistently performed well across all vectorization techniques, with the best results using CountVectorizer and TFIDFVectorizer.¶
SVM: Similar performance to Logistic Regression but slightly lower in accuracy. The best performance was observed with TFIDFVectorizer.¶
Random Forest: Generally lower performance compared to the other two models, particularly with CountVectorizer and TFIDFVectorizer. The computational efficiency was also a concern with this model.¶
3. Comparison Between Vectorizers:¶
CountVectorizer and TFIDFVectorizer: These two techniques provided the highest accuracy scores across all models, particularly with Logistic Regression.¶
Word2Vec and GoogleNews Word2Vec: These techniques resulted in lower accuracy scores compared to CountVectorizer and TFIDFVectorizer for all models.¶
4. Trade-offs and Considerations:¶
Accuracy vs. Computational Efficiency: While Logistic Regression with CountVectorizer and TFIDFVectorizer achieved the highest accuracy, Random Forest models were more computationally demanding.¶
Simplicity vs. Complexity: Logistic Regression and SVM provide a good balance between model complexity and performance, making them suitable choices for this particular task.¶
Final Recommendation:¶
Based on the analysis, Logistic Regression with CountVectorizer or TFIDFVectorizer is recommended as the best approach for this specific sentiment analysis task. These combinations provide the highest accuracy, balanced precision and recall, and are computationally efficient. The choice between CountVectorizer and TFIDFVectorizer can be made based on additional considerations such as interpretability or specific use cases, as their performance is nearly identical.¶
In [ ]: